Allow more than 2^32 sequences to be clustered #1039

martin-steinegger · 2025-09-29T03:57:32Z

No description provided.

yanyanshao · 2025-11-21T04:02:53Z

This is a very impressive and crucial enhancement for large-scale clustering.

We are currently facing a project that requires clustering ~11 billion (1.1e10) protein sequences.

Could you please advise if there is a version of MMseqs2 (like a branch from this PR) that is already capable of handling a dataset of this scale?

If a single run is not yet feasible, what would be the recommended strategy? For example, is the "split-cluster-merge" approach the best practice? Have you conducted any scalability tests or benchmarks for clustering at this unprecedented scale (e.g., tens of billions of sequences)?

Any guidance or insights from you would be immensely helpful for our work. Thank you for developing and continuously improving this fantastic tool!

milot-mirdita · 2025-11-21T04:06:09Z

This PR is very much in development and not production ready. Our current recommendation is still too split the databases into 2-3 billion sequence chunks, cluster each separately. Afterwards, continue to merge the chunks until you reach 2-3 billion again and cluster until everything is done.

We are of course interested in getting native support into MMseqs2 for this, but this might still take a bit.

milot-mirdita · 2025-11-21T04:06:35Z

We have clustered ~100B with the split and merge strategy before.

yanyanshao · 2025-11-21T05:53:54Z

@milot-mirdita
Thanks a lot for the detailed recommendation! I’ve got a follow-up question: after splitting the database into chunks and clustering each one individually, could you share the specific steps for merging these separate clustering results (before we re-cluster the combined set once it hits 2-3 billion sequences again)? I’d really appreciate some concrete guidance here.

yanyanshao · 2025-11-21T06:16:57Z

@milot-mirdita , can you share the step-by-step example (including commands) for this split-cluster-merge workflow in MMseqs2?

martin-steinegger added 11 commits September 7, 2025 16:53

Regression working, switch to IdType

b335a4a

size_t compiles and regression runs through

05182d2

Rewrite kmermatcher and clustering, regression works

cf25e9d

Adjust more, regression is still green

f542b4c

Resolve more issues

609e6ed

Fix more, regression works

1fd06fc

More complicated rewrites, regression okay

b8524a8

One more fixed

16df662

Orf works

6879317

More rewrites

5e9a30c

More changes

b135cba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow more than 2^32 sequences to be clustered #1039

Allow more than 2^32 sequences to be clustered #1039

Uh oh!

martin-steinegger commented Sep 29, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

milot-mirdita commented Nov 21, 2025

Uh oh!

milot-mirdita commented Nov 21, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow more than 2^32 sequences to be clustered #1039

Are you sure you want to change the base?

Allow more than 2^32 sequences to be clustered #1039

Uh oh!

Conversation

martin-steinegger commented Sep 29, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

milot-mirdita commented Nov 21, 2025

Uh oh!

milot-mirdita commented Nov 21, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

yanyanshao commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants